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ABSTRACT 


MARKOVIAN  DECISION  PROCESSES  WITH  LIMITED  STATE 
OBSERVABILITY  AND  UNOBSERVABLE  COSTS 


Janies  D.  Steele,  Ph.D. 


Consider  a finite-state  finite-action  Markovian  Decision  Process 
for  which  the  state  space  has  been  partitioned  into  subsets.  The 
decisionmaker  can  only  oDserve  the  subset  to  which  the  states  of  the 
process  belong,  and  not  the  actual  states  of  the  process.  In  addition, 
the  costs  are  unobservable  in  the  sense  that  the  total  discounted 
cost  is  to  be  assessed  at  infinity.  An  approach  to  this  problem, 
which  makes  use  cf  the  probability  distributions  over  the  state 
space,  is  developed. 
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MARKOVIAN  DECISION  PROCESSES  WITH  LIMITED  STATE 
OBSERVABI  Ll'i  Y AND  UNOBSERVABLE  COSTS 

James  D.  Steele,  Ph.D. 

The  Rand  Corporation,  Santa  Monica,  California 
INTRODUCTION 

Consider  the  situation  in  which  a decisionmaker  periodically 
observes  a process,  at  times  t * 0,1,2,...,  and  at  each  observation 
classifies  the  process  as  being  in  one  of  & possible  number  of  states. 
In  the  first  section  of  this  paper,  we  will  require  that  the  set  of 
all  possible  states  of  the  process  be  a finite  set.  In  the  later 
sections,  we  will  consider  situations  in  which  the  set  of  all  possible 
states  is  uncountable.  After  each  observation,  the  decisionmaker 
chooses  an  action  from  a set  of  possible  actions.  Throughout  this 
paper,  the  set  of  all  possible  actions  will  be  assumed  to  be  a finite 
set.  At  this  point  a cost,  which  depends  on  the  current  state  of 
the  process  and  on  the  particular  action  chosen,  is  incurred  and  the 
next  state  of  the  process  is  chosen  according  to  transition  probabil- 
ities which  depend  on  the  current  state  and  the  particular  action 
chosen.  The  objective  of  the  decisionmaker  is  to  choose  actions  in  a 
manner  such  that  some  particular  cost  criterion  is  minimized,  Through- 
out this  paper,  the  cost  criterion  used  will  be  the  total  expected 
discounted  cost  of  operating  over  the  infinite  future.  The  above 
basically  describes  a Markovian  Oecision  Process  with  the  particular 
cost  criterion  as  defined. 
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In  the  first  section  of  this  paper,  we  review  some  of  the  concepts 
and  definitions  associated  with  the  finite-state  finite-action 
Markovian  Decision  Process.  We  define  the  concept  of  a policy  for  taking 
actions  for  the  decisionmaker  and  we  develop  the  expressions  for  the 
expected  discounted  costs  associated  with  the  use  of  certain  types  of 
policies.  In  this  section,  we  assume  that  the  decisionmaker  knows  the 
current  value  of  the  state  of  the  process  at  each  observation  point. 

Also  in  this  section,  we  assume  that  the  decisionmaker  knows,  immedi- 
ately after  observing  the  current  state  and  taking  an  action,  the  value 
of  the  cost  incurred  at  that  point. 

In  the  remaining  sections  of  this  paper,  we  consider  a finite- 
state  finite-action  Markovian  Decision  Process  in  which  the  decision- 
maker is  not  told  the  exact  state  of  the  process  at  the  observation 
points.  Rather  the  decisionmaker  is  only  told  that  the  current  state 
belongs  to  a particular  subset  of  possible  states.  We  call  this 
"limited  state  obsen  abi 1 i ty ."  The  extreme  case  in  which  the  decision- 
maker is  given  no  information  about  the  current  state  (i.e.  the  subset 
to  which  the  current  state  belongs  is  simpl>  taken  to  be  the  entire 
set  of  all  possible  states  of  the  process),  is  called  'complete 
unobservability."  In  this  cise,  we  say  the  Markovian  Decision  Process 
has  unobservable  states.  Also,  in  the  remaining  sections  of  this 
paper,  we  assume  that  the  decisionmaker  is  not  told  the  value  of  the 
costs  incurred  at  any  observation  point,  rather  we  assume  that  the 
decisionmaker  will  be  told  the  value  of  each  current  tost  tar  enough 
into  the  future  so  that  we  may  assume  that  the  total  cost  will  be 
assessed  at  infinity.  In  this  case,  we  say  that  the  Markovian 
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Decision  Process  has  "unobservable  costs." 

in  this  paper,  we  develop  a methodology  for  analyzing  this  type 
of  a situation.  We  then  refer  the  reader  to  Steele  [*»]  , where  some 
mathematical  results  are  developed  for  this  problem.  A rather  complete 
treatment  for  the  two-state  two-action  case  with  unobservable  states 
and  unobservable  costs  may  be  found  in  Steele  [ 3 J . 
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FINITE-STATE  FINITE-ACTION  HARKOV  IAN  DECISION  PROCESSES 

Consider  the  Markovian  Decision  Process  (HOP)  defined  by  the 

fol lowing  objects : 

State  space  s * {1,2,3,  — ,N},  for  finite  N, 

Action  space  A * {a^^, — .a^},  ^or  f'n'te 

I 

{l 

Cost  set  C * { C ( i ,a j ) : ies,  amA},  where  all  costs  are 

taken  to  be  finite. 

r 

i 

l 

i 

E, 

Transition  probabilities  = iq.j(aK)  : i,jcS,  a^oA), 

't 

Discount  factor  a,  such  that  e<a<). 

f 

5 

At  times,  t = 0,1,2,...,  a decisionmaker  observes  the  current 

l 

state,  X^cS,  of  the  process.  After  observing  the  current  state,  the 

m 

decisionmaker  then  chooses  an  action  a^cA  and  incurs  a cost  CfX^.a^JeC. 

■ 

The  next  state  of  the  process  is  then  chosen  according  to  the  transi- 

tion probabilities  qv  .{a  ). 

xtJ  t 

t 

A policy  for  the  decisionmaker  will  be  defined  as  any  rule  for 

3 

taking  actions  at  each  observation  point  t = 0,1,2 A particular 

policy  m?y  be  such  that  at  each  observation  point,  t,  the  action  taken, 

E 

a^,  may  depenvl  on  the  entire  observed  sequence  of  states  and  actions 

l 

¥ 

jj 

j? 

from  time  t = 0 up  to  and  including  the  current  observation  X^.  A 

? 

r? 

policy  will  be  called  Markovian  if  at  each  point,  t = 0,1,2,...,  the 

k 

i 

action  taken,  a^,  depends  on  me  current  state, X^,  of  the  process  but 

s 

f 

does  not  depend  on  the  observed  sequence  of  states  and  actions  f rc"i 

1 

5 

time  t = o up  to  and  including  time  l -i.  particular  policy  may  be 

E 

~ ‘J  f.  w 


__ _i£r?  .. 
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raivdow' *;.3  in  the  sense  that  at  each  observation  time  t * 0,1,2,..., 
the  action  af  is  chosen  according  to  some  random  procedure.  A par- 
ticular policy, W,  will  be  called  deterministic  if  at  each  observation 
point  t * 0,1,2,...,  there  exists  a map  f : S-+-A  such  that  the  policy 
W chooses  the  current  action  at  according  to  the  rule  a^  ■ ft(Xt).  ,n 
other  words,  a - terministic  policy  may  be  defined  in  terms  of  a sequence 
of  maps  from  S into  A by 


W 


A particular  policy,  W,  will  be  called  stationary  if  there  exists  a 
si'igle  map  f : S--A  such  that  at  each  observation  point  t » 0,1,2,..., 
rhe  policy  W chooses  the  current  action  a£  according  to  the  rule 
•j  * f(X  ).  A stationary  policy  W therefore  may  be  definea  as  W * (f, 
f,  f,  f , . . . ) . In  this  paper,  we  will  simply  consider  a stationary 
policy  W and  its  associated  map  f as  being  the  same,  lnerefore,  we 
say  that  a stationary  policy  for  (MDP)  is  a map  i : S-*A. 

Fcr  any  policy  W,  we  define  the  total  expected  discounted  cost 
of  starting  in  state  i at  t ■ 0, and  using  the  policy  W over  the 
infinite  future,  we 


»w«> 


QD 

l otc(X  ,a  )|X  »i 
t*0  U 


where  is  used  to  indicate  the  dependence  of  the  conditional 
expectation  on  the  policy  W.  If  W is  the  stationary  policy  defined 
in  terms  of  the  map  f,  then  we  note  that 


Other  criteria  for  minimization  may  be  defined  for  the  (MOP) . However, 
in  this  paper  we  wi 1 1 only  consider  the  case  where  the  decisionmaker 
attempts  to  minimize  the  total  expected  discounted  cost.  Howard  [2] 
analyzed  (MDP:s)  having  finite-state  and  f inite-acticn  spaces  and 
proved  that  an  optimal  stationary  policy  fi.e.  a stationary  policy 
which  minimizes  the  total  expected  discounted  cost)  always  exists.  The 
Howard  Policy  Improvement  Routine  is  a method  by  which  an  optimal 
stationary  policy  for  (MOP)  may  be  found. 
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LIMITED  STATE  OBSERVABILITY  AND  UNOBSERVABLE  COSTS 

Consider  the  Markovian  Decision  Process  (MDP)  as  previously 
defined,  having  state  space  S * {1,2,  — ,N}  for  finite  N.  Let  h be 
any  map  defined  on  S and  having  image  in  the  set  of  positive  integers 
{1,2,  — ,N) . That  is  to  say,  that  h : S -*  h (S)  * } where 


u <_  N and  X^.  C S for  j * i ,2,. . . ,u.  In  other  words,  each 


element  of  the  set  h (S)  is  a subset  of  S,  or  h partitions  S into  sub- 
sets. 


We  now  consider  the  situation  where  the  decisionmaker  cannot 
observe  the  current  state,  X^cS,  at  each  observation  point  t*0,l,2,...; 
rather  the  dec  is iormaker  can  only  observe  the  subset,  Yteh(S),  to  which 
the  current  state  X£  belongs.  We  say  that  the  Markovian  Decision 
Process, and  hence  the  decisionmaker,  has  limited  state  observability. 
The  extreme  case  where  j»l,  i.e.  h(S'  * {>(}  * {S}  , represents  the 
case  of  unobservable  states.  The  extreme  case  where  u=N,  i.e. 
h(S)  = {X j , ^2  , . . . ,A^} , represents  the  case  of  complete  stats  obser- 
vability which  of  course  is  simply  the  case  summarized  earlier  in  this 
paper.  Let  & be  the  set  of  all  probability  distributions  over  S,  i.e. 


(p 


•V 


o<Pill 


where  E is  N-d imens tona 1 Euclidean  space,  and  we  let  P.  be  the 

N i 

probability  of  being  in  state  >.  For  each  j such  that  l^J <_N , we  let 

ej  = (0,0 ,0 ,0, . . .0)  cE^  be  the  probability 
. tb 

j place 

. ♦ h 

vector  having  0’s  in  every  place  except  the  j"  piece.  Next,  we  let 

H(j  | , j £ , - * * J r ) f°r  *.1/1^,  and  be  the  convex  hull  of  the  set 
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1 eJ  » e : » • • • *e  ■ J • 

i Jj 


We  note  that 


H(j  i »i2 » • * - j r)  c ^ 


H(jj,j2.-*-jr) 


= {all  probability  vector's  in  E.,  having  o's 


in 


every  place  except  the  j ^ ^ place,  the 


. t h , , . th  , i 

jr  place, , the  j place) 


If  XKeh(S)  is  defined  by  A ^ = tA^,  . . ,AK  }CS,  for  l£flK<N,  then 

K 

we  note  that  ,X^2 , . . . , A^  ),  which  for  convenience  we  will  write 

K 

as  H(>k),  may  be  considered  as  being  the  set  of  all  probability  distri- 
butions over  x^.  We  may  now  prove  the  following  theorem. 


THEOREM:  Suppose  that  at  any  time  t » 0,1,2,..  , we  observe  Yr=Xf, 

for  some  X.ch(S),  and  we  are  told  that  the  current  distribution  ove; 
the  state  space  ^ is  P^.  Suppose  then  that  we  take  the  action  a^eA,  then 
both  the  new  observation  Yfc+]  and  the  new  distribution  F , depend  only 
on  the  current  distribution  Pt  and  the  current  action  a^. 

Proof:  We  know  that  Yt+,eh{S),  i.e.  Yt+)  = AK  for  some  K such  that 

The  conditional  probability  of  XK>  given  x.,  Ft,  and  at,  is 

given  by 


Pr  {Yt+l  = Xk1 Y t Ai’  Pt  “ P’  at  = a} 

= neA  mix.  Snn  (a)  Pm>  where  we  have 

K i 

wi itten  Ft  = P = (Pj  ,P2,...,PN). 

This  proves  that  Y[+)  depends  only  on  F and  af. 


w *■«  /*'*- '*  * /A.'1.’'-  -X* 


y -y-!''  ;.?-*x  * 


Next,  we  note  that  since  we  must  observe  Yt+-  = A^  for  some  K=  l,2,...,u, 
then  it  follows  that  we  must  have  Pt+j  eH(X^)  for  some  K = l,2,...,u. 

If  ws  do  observe  Yt+j  = A^,  for  some  K,  then  we  have 


Ft+i  - r 


are  given  hy 


?.'  = 
J 


= CP ^ , ?2'y  • • • >Pf^) * where  the  components 


£eA.  q£j^  P£ 

i J 

l l q (a)  P 


neA„  meA. 
I\  l 


mn  m 


, for  jeA. 


Pj"  = for  j*AK 


We  shal 1 write 


q(P',  XK|P,X.,a)  - n|,  mIx  %in  <a>  pm.  ="<< 

l\  I 


note  that 


q(P",  AKj°,  A., a)  = Pr{AjA.,P,a}. 


Also,  we  find  that 


Akeh(S)  q^P"’  AKlP,Xi,a^  1 

l\ 


Ther.fore,  we  have  that,  for  those  A^,  such  that  q(P',AjP,A . ,a)  ^ 0, 
the  P"  will  be  given,  with  probability  q (P " , A^ |"F, A . , by 


P'  = (P,' PN')  where 


P.'  = n(P'fXK|P,Af,a)  ^ q„  (a)P£,  for  jeX 

i j- 


?.'  = 0,  for  j^X 

J N 


This,  together  with  the  fact  that 


XKeh(S)  q^P  »AJP»Va^  ‘ ’ 


shows  that  Pt+j  depends  only  on  P^  and  a^. 


Q.E.D. 


*lext,  we  write  Q(a)  = [q.j(a)J  for  the  matrix  of  .ansition 

probab , 1 i tics  associated  with  the  action  We  leL  Q(a)  represent 

AK 

the  matrix  derived  froii'  Q.{»)  by  replacing  the  columns  of  Q(a)  that  are 
not  associated 'w’ th  elements  of  A,.,  with  columns  of  zeros.  We  let  1 eE.. 
represent  the  vector  having  all  components  equal  to  one.  We  may  now 
wr  i te 


q(P',AK|P,X.,a)  = <PQ(a)x  ,T>,  as  an  inner  product  in  EN> 

K 

We  also  have  F't.H(X  ) given  by 

P'  = (PQ(a) . ,T)']  P Q(a).  , with 
xK  x„ 

probability  (PQ(a)  ,T) , when  this  probability  is  not  equal  to  zero. 


We  now  assume  that  the  dec i s i onmal'er  cannot  observe  the  current 
cost  C(Xt,a  ),  at  any  observation  point  t=0,l,2,...  . We  note  that 
if  the  current  distribution,  P , is  known,  then  we  may  compute  .he 
current  value  oi  the  expected  cost  (P^,  C(at)),  where  C*(a t ) = 


WI.ilW  1 1 . 


^C(l,at),  C(2,at) ,...,C(N,af)^  is  the  vector  of  costs  associated  wil 


the  action  at> 


We  may  now  define  the  new  Markovian  Decision  Process  (hftp)  by  the 


following  objects. 


State  space  £ * (aii  probability  distributions  over  S},  with 
supplementary  information  give**  by 
observations  in  h(S). 

Action  space  'K  ■ A * ia j ,a2 , . . . .a^} 

Transition  probabilities  { q(P “ , | P , X . ,a)  : , P e£, 


eh (S) , aeA} 

l\  i 


Cost  set  {<P,C(a}>  : PeS,  aeA} 


Discount  factor  a,  such  that  0<a<l. 

We  note  that  (M|5p)  has  uncountable  state-space  and  finite-action 


space. 

The  class  of  Markovian  Decision  Processes  to  which  (M&P)  belongs 
has  been  analyzed  by  Blackwell  [l].  His  analysis  shows  that  an 
onrinvj  stationary  policy  (i.e.  a map  f : ?r«-A,  which  minimizes  the 
total  expected  discounted  cost)  for  this  class  of  problems  always 
exists,  and  that  the  Howard  Policy  Improvement  Routine  may  be  extended 
to  this  class  of  problems.  However,  in  the  finite-state  finite-action 
class  of  problems,  the  set  of  stationary  policies  is  finite,  and 
therefore  the  Howard  Policy  Improvement  Routine  will  produce  an  optimal 
oolicy  in  a finite  number  of  steps.  In  the  uncountable-state  finite- 
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action  class  of  problems,  the  set  of  stationary  policies  is  uncount- 
able, and  therefore  the  Howard  Policy  Improvement  Routine  will  not  in 
general  produce  an  optimal  policy  in  a finite  number  of  steps. 

f\j 

A preliminary  analysis  of  (HOP)  is  presented  in  Steele  [k].  The 
>V/i<tiysis  is  developed  for  h(S)  * { X ^ } * { S } . However,  the  results 
apply  to  the  cases  <n  Khich  h is  arbitrary. 
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